Negative Binomial Regression

STA6235: Modeling in Regression

Introduction

  • Recall the Poisson regression model:

\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • Today, we will discuss an assumption of the Poisson distribution, and what to do if we break that assumption.

Today’s Data

  • candidatevotes: votes received by this candidate for this particular party

  • totalvotes: total number of votes cast for this election

  • year: year in which election was held

  • region: region that state belongs to, as defined by the US Census Bureau

    • I derived this; please examine the code to see how.
library(tidyverse)
library(fastDummies)

house <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-07/house.csv') %>%
  filter(stage == "GEN" &
         party %in% c("REPUBLICAN", "DEMOCRAT") &
         state != "DISTRICT OF COLUMBIA") %>%
  mutate(candidatevotes = candidatevotes + 1,
         year10 = year/10,
         region = if_else(state %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT", "NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Northeast",
                          if_else(state %in% c("ILLINOIS", "INDIANA", "MICHIGAN", "OHIO", "WISCONSIN", "IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "Midwest",
                                  if_else(state %in% c("DELAWARE", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA","WEST VIRGINIA", "ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE", "ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "South", "West")))) %>%
  dummy_cols(select_columns = c("region"))

Poisson Distribution

  • The Poisson probability mass function is as follows:

f(y; \lambda) = \frac{\lambda^y e^{-\lambda}}{y!},

  • where k is the count (k \in \mathbb{Z}^+).

  • A requirement of this distribution is that the mean (\lambda) equals the variance (\lambda).

  • To check this, we just find the mean and variance of the count outcome we are looking at.

Example

  • Let’s check the mean and variance of our outcome, candidatevotes.
house %>% summarize(mean(candidatevotes), var(candidatevotes))
😱
  • Poisson regression should definitely not be used!

Negative Binomial Regression

  • Let us now consider negative binomial regression:

\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • What is different? The underlying distribution.

  • We now are applying the negative binomial distribution:

f(y; r, p) = {y + r -1 \choose y-1} (1-p)^y p^r

  • where r is the number of successes and p is the probability of success.

Negative Binomial Regression

  • Why does the negative binomial handle overdispersed data better?

  • The variance is as follows,

\text{var}[Y] = \mu + \frac{\mu^2}{k},

  • Here, k is an additional parameter, called the dispersion parameter.

  • As k \to \infty, \text{nb}(r, p) \to \text{Poi}(\lambda).

    • That is, as the dispersion parameter increases, the negative binomial converges in distribution to the Poisson.
  • What does this mean for us?

    • If we apply the negative binomial when the Poisson was appropriate, we will get the same results.

Negative Binomial Regression

  • Why do we care if the data is overdispersed?

  • When data is overdispersed, we know that the standard error is underestimated.

  • Let’s think about how test statistics are created:

z_i = \frac{\hat{\beta}_i}{\text{SE}_{\hat{\beta}_i}}

  • As the standard error reduces, our test statistic increases.

  • As our test statistic increases, our p-value decreases.

  • We are inflating the type I error rate when applying the Poisson to overdispersed data!

R Syntax

  • We now will use the glm.nb() function from the MASS package when specifying the negative binomial distribution.

    • !! WARNING !! the MASS package will overwrite the select() function from dplyr (tidyverse).

    • I prefer to call it using MASS::glm.nb() rather than calling the MASS package into R.

m <- MASS:glm.nb(outcome ~ var_1 + var_2 + ... + var_k, 
            data=dataset)
  • Note that we do not need to specify the distribution inside the function!

    • This function is only for negative binomial.

Example

  • Let’s reproduce the example from the Poisson lecture.

  • We will use the election data to model the number of votes received as a function of the year, the region, the interaction between the party and the year, and the interaction between the party and region.

m <- MASS::glm.nb(candidatevotes ~ year10 +  party + region_Midwest + region_Northeast + region_West + 
           party:year10 + 
           party:region_Midwest + party:region_Northeast + party:region_West + totalvotes,
         data = house)
summary(m)

Call:
MASS::glm.nb(formula = candidatevotes ~ year10 + party + region_Midwest + 
    region_Northeast + region_West + party:year10 + party:region_Midwest + 
    party:region_Northeast + party:region_West + totalvotes, 
    data = house, init.theta = 3.044428116, link = log)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       1.611e+01  9.049e-01  17.801  < 2e-16 ***
year10                           -2.807e-02  4.549e-03  -6.172 6.76e-10 ***
partyREPUBLICAN                  -6.225e+00  1.184e+00  -5.260 1.44e-07 ***
region_Midwest                    3.022e-02  1.567e-02   1.928   0.0538 .  
region_Northeast                  9.109e-02  1.619e-02   5.625 1.86e-08 ***
region_West                       7.021e-02  1.580e-02   4.444 8.83e-06 ***
totalvotes                        4.770e-06  6.079e-08  78.466  < 2e-16 ***
year10:partyREPUBLICAN            3.127e-02  5.916e-03   5.286 1.25e-07 ***
partyREPUBLICAN:region_Midwest   -3.984e-02  2.191e-02  -1.818   0.0690 .  
partyREPUBLICAN:region_Northeast -2.792e-01  2.318e-02 -12.044  < 2e-16 ***
partyREPUBLICAN:region_West      -1.634e-01  2.250e-02  -7.261 3.83e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(3.0444) family taken to be 1)

    Null deviance: 28139  on 19566  degrees of freedom
Residual deviance: 20959  on 19556  degrees of freedom
AIC: 479769

Number of Fisher Scoring iterations: 1

              Theta:  3.0444 
          Std. Err.:  0.0298 

 2 x log-likelihood:  -479745.1550 

\begin{align*} \ln(\hat{y}) =& 16.11 - 0.03 \text{ year} - 6.22 \text{ repub.} + 0.03 \text{ midwest} + 0.09 \text{ northeast} + 0.07 \text{ west} \\ & + 0.03 (\text{repub. $\times$ year}) \\ & - 0.04 (\text{repub. $\times$ midwest}) - 0.28 (\text{repub. $\times$ northeast}) - 0.16 (\text{repub. $\times$ west}) \end{align*}

Inference - Confidence Intervals

  • We again find confidence intervals as we did before, using the confint() function.
confint(m)
                                         2.5 %        97.5 %
(Intercept)                       1.431317e+01  1.790104e+01
year10                           -3.710638e-02 -1.904309e-02
partyREPUBLICAN                  -8.566580e+00 -3.883095e+00
region_Midwest                   -4.307423e-04  6.091404e-02
region_Northeast                  5.945774e-02  1.227981e-01
region_West                       3.925256e-02  1.012285e-01
totalvotes                        4.631954e-06  4.907906e-06
year10:partyREPUBLICAN            1.956598e-02  4.297417e-02
partyREPUBLICAN:region_Midwest   -8.282722e-02  3.149976e-03
partyREPUBLICAN:region_Northeast -3.247801e-01 -2.336813e-01
partyREPUBLICAN:region_West      -2.075389e-01 -1.191787e-01

Inference - Hypothesis Testing

  • We take the same approach to determining significance as before.

  • A single term in the model \to summary().

  • Multiple terms in the model \to full/reduced anova().

    • This is required for a categorical variable with three or more categories.

Example

  • Is the interaction between party and year significant?
summary(m)

Call:
MASS::glm.nb(formula = candidatevotes ~ year10 + party + region_Midwest + 
    region_Northeast + region_West + party:year10 + party:region_Midwest + 
    party:region_Northeast + party:region_West + totalvotes, 
    data = house, init.theta = 3.044428116, link = log)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       1.611e+01  9.049e-01  17.801  < 2e-16 ***
year10                           -2.807e-02  4.549e-03  -6.172 6.76e-10 ***
partyREPUBLICAN                  -6.225e+00  1.184e+00  -5.260 1.44e-07 ***
region_Midwest                    3.022e-02  1.567e-02   1.928   0.0538 .  
region_Northeast                  9.109e-02  1.619e-02   5.625 1.86e-08 ***
region_West                       7.021e-02  1.580e-02   4.444 8.83e-06 ***
totalvotes                        4.770e-06  6.079e-08  78.466  < 2e-16 ***
year10:partyREPUBLICAN            3.127e-02  5.916e-03   5.286 1.25e-07 ***
partyREPUBLICAN:region_Midwest   -3.984e-02  2.191e-02  -1.818   0.0690 .  
partyREPUBLICAN:region_Northeast -2.792e-01  2.318e-02 -12.044  < 2e-16 ***
partyREPUBLICAN:region_West      -1.634e-01  2.250e-02  -7.261 3.83e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(3.0444) family taken to be 1)

    Null deviance: 28139  on 19566  degrees of freedom
Residual deviance: 20959  on 19556  degrees of freedom
AIC: 479769

Number of Fisher Scoring iterations: 1

              Theta:  3.0444 
          Std. Err.:  0.0298 

 2 x log-likelihood:  -479745.1550 
  • Yes! p < 0.001.

Example

  • Is the interaction between party and region significant?
  • Yes! p < 0.001.
f <- MASS::glm.nb(candidatevotes ~ year10 +  party + region_Midwest + region_Northeast + region_West + party:year10 + party:region_Midwest + party:region_Northeast + party:region_West + totalvotes, data = house)
r <- MASS::glm.nb(candidatevotes ~ year10 +  party + region_Midwest + region_Northeast + region_West + party:year10 + totalvotes, data = house)
anova(r, f, test = "Chisq")

Comparison

  • Poisson model:

\begin{align*} \ln(\hat{y}) =& -2.62 + 0.07 \text{ year} - 4.40 \text{ repub.} + 0.11 \text{ midwest} + 0.12 \text{ northeast} + 0.11 \text{ west} \\ & + 0.02 (\text{repub. $\times$ year}) \\ & - 0.06 (\text{repub. $\times$ midwest}) - 0.31 (\text{repub. $\times$ northeast}) - 0.19 (\text{repub. $\times$ west}) \end{align*}

  • Negative binomial model:

\begin{align*} \ln(\hat{y}) =& 16.11 - 0.03 \text{ year} - 6.22 \text{ repub.} + 0.03 \text{ midwest} + 0.09 \text{ northeast} + 0.07 \text{ west} \\ & + 0.03 (\text{repub. $\times$ year}) \\ & - 0.04 (\text{repub. $\times$ midwest}) - 0.28 (\text{repub. $\times$ northeast}) - 0.16 (\text{repub. $\times$ west}) \end{align*}

Wrap Up

  • We now have learned modeling for count data.

    • Mean = variance \to Poisson.

    • Mean >>> variance \to negative binomial.

  • Note that there is a method to handle excessive zeros in the model: zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB).

  • In today’s assignment, you will remove the outlier, reanalyze the data, and compare the results.